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THREAD SCHEDULING ON MULTIPROCESSOR SYSTEMS 

BACKGROUND 
1. FIELD 

5 This disclosure relates generally to multi-thread applications on multi- 

processor systems, and, more specifically but not exclusively, to thread 
scheduling on a multi-processor system. 



2. DESCRIPTION 

10 A threaded application usually has data shared among its threads when 

running on symmetric multiprocessors ("SMP") and/or chip multiprocessors 
("CMP"). The data sharing among different threads may be achieved in different 
ways but frequently done through a shared system-level memory. In a typical 
memory hierarchy in a multiprocessor system, a shared system-level memory 

15 between different processing cores has longer access latency for a processing 
core than a local cache of the processing core. Additionally, traffic (including 
coherency traffic) among different processing cores generated by excessive 
access to a shared system-level memory may saturate the bandwidth of the 
system interconnect (e.g., bus, ring, mesh, etc.). Therefore, it is desirable to 

20 investigate the data sharing behavior among different threads and reduce the 
cost of data transfer among threads. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The features and advantages of the disclosed subject matter will become 
apparent from the following detailed description of the subject matter in which: 

Figure 1 is a block diagram of an example multiprocessor system that 
5 uses a data sharing aware thread scheduling module, according to the disclosed 
subject matter in the present application; 

Figure 2 is a block diagram of another example multiprocessor system 
that uses a data sharing aware thread scheduling module, according to the 
disclosed subject matter in the present application; 
10 Figure 3 shows a block diagram of a data sharing aware thread 

scheduling module, according to the disclosed subject matter in the present 
application; 

Figure 4 is a flowchart illustrating an example process for scheduling 
threads to target processors using information of data sharing behavior among 
15 different threads, according to the disclosed subject matter in the present 
application; 

Figure 5 illustrates an example program that can be multithreaded in a 
multiprocessor system; 

Figure 6 illustrates a code corresponding to the program shown in Figure 
20 5, which is multithreaded for a multiprocessor system; 

Figure 7 illustrates an example assembly code of the multi-threaded 
program shown in Figure 6; 
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Figure 8 illustrates an example assembly code of the multi-threaded 
program shown in Figure 6, which uses a data sharing aware thread scheduling 
method, according to the disclosed subject matter in the present application; 

Figure 9 illustrates an example code of a multi-threaded video mining 
5 program, according to the disclosed subject matter in the present application; 
and 

Figure 10 illustrates performance improvement of a multithread 
application running on a multiprocessor system, obtained by using a data sharing 
aware thread scheduling module, according to the disclosed subject matter in the 
10 present application. 

DETAILED DESCRIPTION 

According to embodiments of the subject matter disclosed in this 
application, a compiler in a multiprocessor system may compile a received 

15 multithreaded application, analyze data sharing behavior among multiple threads 
of the multithreaded application, and provide such information to a thread 
scheduler in the multiprocessor system. At run time, the thread scheduler may 
group together threads that share data frequently based on the data sharing 
information provided by the compiler, and schedule threads in the same group to 

20 processors in the same cluster. Processors in the same cluster have a shared 
storage device and have shorter access latency to the shared storage device 
than to the system-level memory. If there are not enough available processors in 
the same cluster, the rest of the threads may be assigned to processors that are 
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electronically in proximity to the cluster. Additionally, a feedback module may 
collect information on data sharing behavior among threads during run time and 
feedback such information to the thread scheduler. The thread scheduler may 
use the feedback information to regroup and reschedule the threads to 
5 processors at the next available scheduling time. 

Reference in the specification to "one embodiment" or "an embodiment" of 
the disclosed subject matter means that a particular feature, structure or 
characteristic described in connection with the embodiment is included in at least 
one embodiment of the disclosed subject matter. Thus, the appearances of the 

10 phrase "in one embodiment" appearing in various places throughout the 
specification are not necessarily all referring to the same embodiment. 

Figure 1 is a block diagram of an example multiprocessor system 100 that 
uses a data sharing aware thread scheduling module. System 100 may 
comprise multiple processors such as processor A (132A). Processors in 

15 system 100 may be connected to each other using a system interconnect 110. 
System interconnect 110 may be a Front Side Bus (FSB). Each processor may 
be connected to Input/Output (IO) devices as well as system-level memory 120 
through the system interconnect. The system-level memory may be a Dynamic 
Random Access Memory (DRAM), a Synchronous DRAM (SDRAM), a Double 

20 Data Rate (DDR) SDRAM, or other types of memory devices. Each processor 
(e.g., processor A (132A)) may have its own cache with one or more levels (e.g., 
level 1 (L1) and level 2 (L2) caches, not shown in the figure). In one 
embodiment, several processors (e.g., processor A (132A) through processor M 
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(132M)) may have a shared local-level storage device such as cache 138. 
Cache 138 may be connected with processor 132A through processor 132M via 
a connection mechanism 136. Connection mechanism 136 may be a local 
interconnect (e.g., local bus), a connection ring, or a cross-bar. 
5 Processors 132A through 132M, connection mechanism 136, and cache 

138, together may form a cluster (cluster 1 (130)). Cluster 1 may be connected 
with other clusters (e.g., cluster N (140)), individual processors (not shown in the 
figure), IO devices and system-level memory 130, and other devices via system 
interconnect 110. In one embodiment, an individual processor (not shown in the 

10 figure) may have multiple processing cores. Each processing core may have its 
own cache and all of the processing cores together may have shared local 
cache. Such a processor with multiple processing cores may be treated similarly 
as a cluster (e.g., cluster 1). For the convenience of description, a processing 
core inside a processor may be treated in the same way as a single-core 

15 processor. Thus, the word "processor" may be used to represent either a single- 
core processor or a processing core in a multi-core processor. 

A data sharing aware thread scheduling module (e.g., 134A, 134M) may 
be included in a processor (such as processor 132A or 132M). The thread 
scheduling module may comprise a compiler, a thread scheduler, and a 

20 feedback module. The compiler may receive a multithreaded application and 
compile the application into one or more object code which is specific to the 
underlying architecture. The thread scheduler may be a part of, or associate 
with, an operating system (OS). The thread scheduler schedules multiple 
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threads of the multithreaded application to different processors. The feedback 
module may observe actual data sharing behavior among threads during 
execution and provide the actual data sharing information to the thread 
scheduler, which will use the feedback information to fine tune its thread 
5 scheduling next time. 

When two or more threads share data frequently, such data sharing is 
often achieved through shared system-level memory (e.g., memory 120). 
Access to a system-level memory has longer latency by a processor (e.g., 
processor 132A) than access to a local-level memory (e.g., cache 138) by the 

10 processor. Additionally, frequent access to a system-level memory consumes 
quite a bit bandwidth of the system interconnect (e.g., 110) and may even cause 
congestion on the system interconnect. Furthermore, coherence traffic 
generated by access to a system-level memory typically has high overhead. 

A conventional thread scheduler does not have much knowledge on data 

15 sharing behavior among multiple threads, and when it schedules multiple 
threads, it mainly focuses on memory hierarchy optimization. According to one 
embodiment of the subject matter in the present application, the thread 
scheduler may utilize information on data sharing behavior among threads 
provided by the compiler, which is able to identify such information when 

20 compiling a multithreaded application. Additionally, the thread scheduler has the 
ability of knowing the underlying architecture of the system, for example, which 
processors are in the same cluster and which clusters are in proximity to each 
other electronically. Clusters that are in proximity to each other electronically 
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include such clusters where data sharing among their processors requires less 
time, causes less traffic on the system interconnect, and/or incurs less 
coherence overhead, than data sharing among processors from other clusters. 
Such clusters may include those that are close to each other in topology. With 
5 data sharing information from the compiler plus knowledge of underlying 
architectural characteristics, the thread scheduler (e.g., 134A) may schedule 
those tightly coupled threads (with frequent data sharing among them) to 
processors in the same cluster. If there are not enough available processors in 
any cluster, the thread scheduler may schedule those tightly coupled threads to 

10 processors in those clusters that are in proximity to each other electronically. 

Figure 2 is a block diagram of another example multiprocessor system 
200 that uses a data sharing aware thread scheduling module. In system 200, 
system interconnect 210 that connects multiple clusters (e.g., 220A, 220B, 220C, 
and 220D) is a links-based point-to-point connection. Each cluster may connect 

15 to the system interconnect through a links hub (e.g., 230A, 230B, 230C, and 
230D). In some embodiments, a links hub may be co-located with a memory 
controller, which coordinates traffic to/from a system memory (not shown in the 
figure). Similar to cluster 1 (130) shown in Figure 1, clusters 220A, 220B, 220C, 
and 220D may comprise two or more processors (or processing cores), which 

20 are connected with a shared local storage device (e.g., last level cache). Each 
processor may have its own local caches. 

A processor in a cluster may have a data sharing aware thread scheduling 
module which may comprise a compiler, a thread scheduler, and a feedback 



module. The compiler receives a multithreaded application and compiles it into 
object code specific to the underlying hardware. The thread scheduler may 
schedule multiple threads of the multithreaded application to various processors. 
The feedback module may provide actual data sharing information among 
5 threads during execution to the thread scheduler, which will use such information 
for fine tuning thread scheduling next time. 

According to one embodiment of the subject matter disclosed in the 
present application, the compiler may provide the thread scheduler data sharing 
information among threads. The thread scheduler may also obtain the 

10 underlying hardware configuration from the OS or a hardware abstract layer 
(e.g., virtual machine). Based on the data sharing information and the hardware 
structural information, the thread scheduler may assign tightly coupled threads to 
those processors which are in proximity to each other electronically, for example, 
processors in the same cluster, or processors in clusters that are in proximity to 

15 each other electronically if there are not enough available processors in one 
cluster. 

Figure 3 shows a block diagram of a data sharing aware thread 
scheduling module 300 according to the subject matter disclosed in the present 
application. The module may comprise a compiler 320, a thread scheduler 330, 
20 and a feedback module 350. For the purpose of convenient description, the 
figure also includes a multithreaded application 310 and target processors 340. 
Compiler 320 may receive a multithreaded application and compile the 
application into one or more object code which is specific to the underlying 



architecture. Additionally, the compiler analyzes data sharing behavior among 
different threads and obtains data sharing information before the threads are 
executed on target processors. The compiler then provides such data sharing 
information to the thread scheduler. 
5 Thread scheduler 330 may be a part of, or associate with, an operating 

system (OS). Based on the data sharing information provided by the compiler, 
the thread scheduler may divide all of the threads into different groups. Threads 
within the same group share data with each other frequently. In one 
embodiment, the compiler may group all the threads into different groups based 

10 on their data sharing behavior obtained during compilation and pass the 
grouping information to the thread scheduler. Additionally, the thread scheduler 
is capable of obtaining structural characteristics of the underlying architecture of 
a multiprocessor system such as, for example, which processors are in the same 
cluster and which clusters are in proximity to each other electronically. If the OS 

15 has the knowledge of hardware structural features, the thread scheduler may 
obtain such hardware structural information from the OS. If the OS does not 
have direct knowledge of hardware structural features (e.g., in a situation where 
a virtual machine exists between the OS and the underlying hardware), the OS 
or the thread scheduler may still get such hardware information by invoking 

20 certain application program interfaces (API's). 

With the hardware information, the thread scheduler may assign threads 
in the same group to processors in the same cluster. If there are not enough 
processors available in any cluster, the thread schedule may identify a cluster 
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that has the highest number of available processors, assign a corresponding 
number of threads in a group to processors in the cluster, and assign the rest of 
threads in the group to processors in another cluster that is in proximity to the 
cluster electronically. In case the cluster that has the highest number of 
5 available processors cannot host all of the threads in a group and does not have 
any cluster in proximity electronically that has enough available processors that 
can host the rest of the threads in the group, the thread scheduler may assign 
threads in the group to clusters that are in proximity to each other electronically 
and that together have enough processors available to host all of the threads in 
10 the group. 

Feedback module 350 may observe actual data sharing behavior among 
threads during execution and provide this actual data sharing information to the 
thread scheduler. The thread scheduler may use such feedback information on 
actual data sharing behavior to regroup threads of the multithreaded application 

15 and reschedule the threads to target processors according to the regrouping 
result, at the next available thread scheduling time. The thread scheduler may 
first examine whether the feedback information supports regrouping and 
rescheduling. For example, if the actual data sharing behavior during execution 
does not deviate from the current grouping significantly, the thread scheduler 

20 may decide not to regroup or reschedule the threads. In one embodiment, the 
thread scheduler may decide whether to regroup and reschedule threads of the 
application based on the execution status of the application. For example, if the 
application is close to completion when the feedback information regarding 
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actual data sharing behavior is received and supports regrouping and 
rescheduling, the thread scheduler may choose not to regroup or reschedule 
threads at the next scheduling time. 

Figure 4 is a flowchart illustrating an example process 400 for scheduling 
5 threads to target processors using information of data sharing behavior among 
different threads, according to the subject matter disclosed in the present 
application. At block 410, a multithreaded application may be received and 
passed to a compiler. At block 420, the multithreaded application may be 
compiled by the compiler. While compiling the application, the compiler may 

10 analyze data sharing behavior among threads of the application and collect data 
sharing information at block 430. Also at block 430, the data sharing information 
may be provided to a thread scheduler by the compiler. At block 440, threads 
that share data frequently, in other words, threads that are tightly coupled, may 
be identified based on the data sharing information provided by the compiler. 

15 All of the threads of the application may be placed into different groups, with 
those tightly coupled threads being in the same group. In one embodiment, the 
compiler may group all of the threads of the application using the data sharing 
information obtained during compilation and then pass the grouping information 
to the thread scheduler. 

20 At block 450, threads in each group may be scheduled to processors that 

are in proximity to each other electronically. If there is any cluster that has 
enough available processors, threads in a group will be scheduled to processors 
in the same cluster; otherwise, threads will be scheduled to processors in the 
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minimum number of clusters that are in proximity to processors in the cluster. At 
block 460, threads that have been assigned may be executed on target 
processors. At block 470, a feedback module may observe actual data sharing 
behavior during execution on the target processors. At block 480 such actual 

5 data sharing information may be provided to the thread scheduler. Upon 
receiving such information, the thread scheduler will decide whether 
rescheduling of threads is necessary at the next scheduling time. If it is, the 
thread scheduler may regroup threads or fine tune previous grouping and 
reschedule the threads based on regrouping or fine tuning at the next available 

10 time. 

Figure 5 illustrates an example program 500 that can be multithreaded in 
a multiprocessor system. This is a simple pointer chasing application. 
Depending on the size of the list p, it may take a long time to complete execution 
of this simple application. Thus, it may be necessary to multithread this 

15 application and run multiple threads in parallel. There are several ways to 
achieve parallel implementation of an application in multiple threads. One way is 
to use OpenMP, a parallel programming language, which has been supported by 
many compilers, such as Intel® C++ and Fortran compilers. The OpenMP 
programming model provides a work-queue execution model with task-queue 

20 and task primitives to allow users to efficiently exploit parallelism among irregular 
patterns of dynamic data structures and/or those with complicated control 
structures such as recursion. In the following description, the OpenMP 
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programming model is used as an example to illustrate how a data sharing 
aware thread scheduling scheme, as disclosed in this application, works. 

Figure 6 illustrates a code 600 corresponding to the program shown in 
Figure 5, which is multithreaded for a multiprocessor system using the OpenMP 
5 programming model. The main differences between code 600 and code 500 are 
two added lines (Line 615 and line 635) in code 600, which parallelize the pointer 
chasing application shown in Figure 5. 

Figure 7 illustrates an example assembly code 700 of the multi-threaded 
program shown in Figure 6. Assembly code 700 is obtained by compiling code 

10 600. Lines 750 through 770 are assembly translations for the master thread. 
Code 700 illustrates how thread scheduling is done without considering data 
sharing behavior among threads. Line 710 creates a team of threads. The 
compiler may use data sharing information among threads obtained during 
compilation to create a team of threads by placing those tightly coupled threads 

15 into the same team. However, assembly 700 does not provide a thread 
scheduler such data sharing information among threads. Thus, when the thread 
scheduler schedules a team of threads, it does not necessarily assign threads in 
the same team to processors in the same cluster. In other words, even if the 
data sharing information is considered in creating a team of threads, without 

20 informing the thread scheduler of such consideration, the resulting scheduling 
will not be based on the data sharing information obtained by the compiler. 

Figure 8 illustrates an example assembly code 800 of the multi-threaded 
program shown in Figure 6, which uses a data sharing aware thread scheduling 
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method, according to the subject matter disclosed in the present application. 
The main difference between code 800 and code 700 is line 815 which is not in 
code 700. In code 800, line 810 creates a team of threads with consideration of 
data sharing information among threads obtained by the compiler. Line 815 then 
5 informs the thread scheduler that threads in the same team should be scheduled 
to processors in the same cluster or in clusters that are in proximity to each other 
electronically. 

Figure 9 illustrates an example code 900 of a multi-threaded video mining 
program, according to the subject matter disclosed in the present application. 

10 Code 900 is a sample from a video mining application. This application uses a 
task-queue to build a pipeline based parallel version. Within one task-queue, 
one thread is responsible for video decoding, and other worker threads perform 
feature extraction from decoded frames. In this application, the number of 
threads is obtained by invoking the function omp_get_num_threads(). Following 

15 the disclosed data sharing aware scheduling scheme, the compiler can easily 
add the scheduling hint information (e.g., line 815 in Figure 8) and the thread 
scheduler can then utilize this information to achieve efficient thread scheduling. 

Figure 10 illustrates a performance improvement of a multithread 
application running on a multiprocessor system, obtained by using a data sharing 

20 aware thread scheduling module, according to the disclosed subject matter in the 
present application. An experiment was conducted by using a 32-way shared 
memory multiprocessor system to study the effectiveness of the disclosed data 
sharing aware thread scheduling scheme. Each processor is equipped with an 
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8KB L1 cache, 512KB L2 cache, and 4M L3 Cache. Each cluster has 4 
processors that share a 32MB L4 cache. The workload used for the study is a 
video mining application and a hybrid parallel scheme is employed, which 
exploits both data and task level parallelism to parallelize the application. With 
5 the hint information of data sharing provided by the compiler, the thread 
scheduler allocates threads in a team to processors in the same cluster (in this 
particular experiment, a team of threads has 4 threads). 

For comparison purposes, two different cases are studied. In case 2, the 
disclosed data sharing thread scheduling scheme is used. In case 1, closely 

10 coupled threads, among which there is a lot of data sharing, are intentionally 
allocated to processors scattered in different clusters. Performance of each 
case is compared with the default scheduling scheme, which does not allocate 
threads without considering data sharing information among the threads. As 
shown in Figure 10, the x-axis shows the number of processors which are 

15 actually used in the multiple processor system; the y-axis shows the 
performance ratio between each of the two studied cases and the default 
scheduling scheme. Experiment results clearly show that performance in case 1 
is consistently below performance in the default case for different number of 
processors. However, performance in case 2, which uses the disclosed data 

20 sharing aware thread scheduling scheme, is significantly better than 
performance in the default case for any number of processors studied in the 
experiment. 
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Although an example embodiment of the disclosed subject matter is 
described with reference to block and flow diagrams in Figures 1-10, persons of 
ordinary skill in the art will readily appreciate that many other methods of 
implementing the disclosed subject matter may alternatively be used. For 
5 example, the order of execution of the blocks in flow diagrams may be changed, 
and/or some of the blocks in block/flow diagrams described may be changed, 
eliminated, or combined. 

In the preceding description, various aspects of the disclosed subject 
matter have been described. For purposes of explanation, specific numbers, 

10 systems and configurations were set forth in order to provide a thorough 
understanding of the subject matter. However, it is apparent to one skilled in the 
art having the benefit of this disclosure that the subject matter may be practiced 
without the specific details. In other instances, well-known features, 
components, or modules were omitted, simplified, combined, or split in order not 

1 5 to obscure the disclosed subject matter. 

Various embodiments of the disclosed subject matter may be 
implemented in hardware, firmware, software, or combination thereof, and may 
be described by reference to or in conjunction with program code, such as 
instructions, functions, procedures, data structures, logic, application programs, 

20 design representations or formats for simulation, emulation, and fabrication of a 
design, which when accessed by a machine results in the machine performing 
tasks, defining abstract data types or low-level hardware contexts, or producing a 
result. 
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For simulations, program code may represent hardware using a hardware 
description language or another functional description language which 
essentially provides a model of how designed hardware is expected to perform. 
Program code may be assembly or machine language, or data that may be 
compiled and/or interpreted. Furthermore, it is common in the art to speak of 
software, in one form or another as taking an action or causing a result. Such 
expressions are merely a shorthand way of stating execution of program code by 
a processing system which causes a processor to perform an action or produce 
a result. 

Program code may be stored in, for example, volatile and/or non-volatile 
memory, such as storage devices and/or an associated machine readable or 
machine accessible medium including solid-state memory, hard-drives, floppy- 
disks, optical storage, tapes, flash memory, memory sticks, digital video disks, 
digital versatile discs (DVDs), etc., as well as more exotic mediums such as 
machine-accessible biological state preserving storage. A machine readable 
medium may include any mechanism for storing, transmitting, or receiving 
information in a form readable by a machine, and the medium may include a 
tangible medium through which electrical, optical, acoustical or other form of 
propagated signals or carrier wave encoding the program code may pass, such 
as antennas, optical fibers, communications interfaces, etc. Program code may 
be transmitted in the form of packets, serial data, parallel data, propagated 
signals, etc., and may be used in a compressed or encrypted format. 
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Program code may be implemented in programs executing on 
programmable machines such as mobile or stationary computers, personal 
digital assistants, set top boxes, cellular telephones and pagers, and other 
electronic devices, each including a processor, volatile and/or non-volatile 
5 memory readable by the processor, at least one input device and/or one or more 
output devices. Program code may be applied to the data entered using the 
input device to perform the described embodiments and to generate output 
information. The output information may be applied to one or more output 
devices. One of ordinary skill in the art may appreciate that embodiments of the 

10 disclosed subject matter can be practiced with various computer system 
configurations, including multiprocessor or multiple-core processor systems, 
minicomputers, mainframe computers, as well as pervasive or miniature 
computers or processors that may be embedded into virtually any device. 
Embodiments of the disclosed subject matter can also be practiced in distributed 

15 computing environments where tasks may be performed by remote processing 
devices that are linked through a communications network. 

Although operations may be described as a sequential process, some of 
the operations may in fact be performed in parallel, concurrently, and/or in a 
distributed environment, and with program code stored locally and/or remotely 

20 for access by single or multi-processor machines. In addition, in some 
embodiments the order of operations may be rearranged without departing from 
the spirit of the disclosed subject matter. Program code may be used by or in 
conjunction with embedded controllers. 
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While the disclosed subject matter has been described with reference to 
illustrative embodiments, this description is not intended to be construed in a 
limiting sense. Various modifications of the illustrative embodiments, as well as 
other embodiments of the subject matter, which are apparent to persons skilled 
5 in the art to which the disclosed subject matter pertains are deemed to lie within 
the scope of the disclosed subject matter. 
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CLAIMS 



What is claimed is: 

1 . A method for thread scheduling for a multithreaded application in a 
multiprocessor system, comprising: 

obtaining information on data sharing behavior among multiple threads of 
said multithreaded application; 

grouping said multiple threads into at least one group based at least in 
part on said information on data sharing behavior among said multiple threads; 
and 

scheduling said group of threads to target processors, said target 
processors being in proximity to each other electronically. 

2. The method of claim 1 , wherein obtaining information on data sharing 
behavior comprises: 

compiling said multithreaded application; 

analyzing data sharing behavior among said multiple threads; and 

collecting said information on data sharing behavior. 

3. The method of claim 1 , wherein grouping said multiple threads 
comprises: 

identifying tightly coupled threads; and 

placing said identified threads into the same group. 
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1 4. The method of claim 1 , wherein scheduling said group of threads to 

2 target processors comprises: 

3 assigning threads in the same group to said target processors that are in 

4 a cluster; and 

5 if there are not enough available processors in said cluster, assigning the 

6 rest of the treads to processors that are electronically in proximity to said cluster. 

1 5. The method of claim 1 , further comprising: 

2 executing said multiple threads of said multithreaded application on said 

3 multiprocessor system; and 

4 observing data sharing behavior among said multiple threads during 

5 execution. 

1 6. The method of claim 5, further comprising: 

2 providing feedback of said data sharing behavior to regroup said multiple 

3 threads; and 

4 rescheduling, at the next scheduling time, said regrouped multiple threads 

5 by assigning threads in the same group to target processors that are 

6 electronically in proximity to each other. 

1 7. The method of claim 1 , wherein grouping said multiple threads further 

2 comprises regrouping said multiple threads based on feedback of data sharing 

3 behavior among said multiple threads during execution. 
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1 8. An article comprising a machine-readable medium that contains 

2 instructions, which when executed by a processing platform, cause said 

3 processing platform to perform operations comprising: 

4 obtaining information on data sharing behavior among multiple threads of 

5 said multithreaded application; 

6 grouping said multiple threads into at least one group based at least in 

7 part on said information on data sharing behavior among said multiple threads; 

8 and 

9 scheduling said group of threads to target processors, said target 
1 0 processors being in proximity to each other electronically. 

1 9. The article of claim 8, wherein obtaining information on data sharing 

2 behavior comprises: 

3 compiling said multithreaded application; 

4 analyzing data sharing behavior among said multiple threads; and 

5 collecting said information on data sharing behavior. 

1 10. The article of claim 8, wherein grouping said multiple threads 

2 comprises: 

3 identifying tightly coupled threads; and 

4 placing said identified threads into the same group. 
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1 11. The article of claim 8, wherein scheduling said group of threads to 

2 target processors comprises: 

3 assigning threads in the same group to said target processors that are in 

4 a cluster; and 

5 if there are not enough available processors in said cluster, assigning the 

6 rest of the threads to processors that are electronically in proximity to said 

7 cluster. 

1 12. The article of claim 8, wherein said operations further comprises: 

2 executing said multiple threads of said multithreaded application on said 

3 multiprocessor system; 

4 observing data sharing behavior among said multiple threads during 

5 execution; 

6 providing feedback of said data sharing behavior to regroup said multiple 

7 threads; and 

8 rescheduling, at the next scheduling time, said regrouped multiple threads 

9 by assigning threads in the same group to target processors that are 
1 0 electronically in proximity to each other. 

1 13. The article of claim 8, wherein grouping said multiple threads further 

2 comprises regrouping said multiple threads based on feedback of data sharing 

3 behavior among said multiple threads during execution. 
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1 14. An apparatus for thread scheduling for a multithreaded application on 

2 a multiprocessor system, comprising: 

3 a compiler to compile said multithreaded application, to analyze data 

4 sharing behavior among multiple threads of said multithreaded application, and 

5 to obtain information on data sharing behavior among said multiple threads; and 

6 a thread scheduler to receive said information on data sharing behavior 

7 among said multiple threads and to schedule said multiple threads to target 

8 processors based at least in part on said information on data sharing behavior. 

1 15. The apparatus of claim 14, wherein said thread scheduler groups 



2 tightly coupled threads into the same group and assigns said threads in the 

3 same group to processors in a cluster, and if there are not enough available 

4 processors in said cluster, assigns the rest of said treads to processors that are 

5 electronically in proximity to said cluster. 



1 16. The apparatus of claim 14, further comprises a plurality of processors 

2 to receive and execute said scheduled threads. 

1 17. The apparatus of claim 16, further comprises a feedback module to 

2 observe data sharing behavior among said scheduled threads during execution 

3 and to provide feedback on said data sharing behavior during execution to said 

4 thread scheduler. 
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1 18. The apparatus of claim 17, wherein said thread scheduler regroups 

2 and reschedules, at the next scheduling time, said multiple threads to target 

3 processors based on said feedback on said data sharing behavior during 

4 execution. 



1 1 9. A multiprocessor system, comprising: 

2 a synchronous dynamic random access memory ("SDRAM"); 

3 a plurality of processors coupled to access said SDRAM via a system 

4 interconnect, said plurality of processors forming at least one cluster, wherein 

5 processors in the same cluster have a shared storage device; 

6 at least one compiler in one or more of said plurality of processors to 

7 receive and compile a multithreaded application, to analyze data sharing 

8 behavior among multiple threads of said multithreaded application, and to obtain 

9 information on data sharing behavior among said multiple threads; and 

10 at least one thread scheduler in one or more of said plurality of processors 



1 1 to receive said information on data sharing behavior among said multiple threads 

12 from said compiler, and to schedule said multiple threads to said plurality of 

13 processors based at least in part on said information on data sharing behavior. 

1 20. The system of claim 19, wherein said thread scheduler groups tightly 

2 coupled threads into the same group and assigns said threads in the same 

3 group to processors in a cluster, and if there are not enough available 
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4 processors in said cluster, assigns the rest of said treads to processors that are 

5 electronically in proximity to said cluster. 

1 21 . The system of claim 1 9, further comprises a feedback module to 

2 observe data sharing behavior among said scheduled threads during execution 

3 and to provide feedback on said data sharing behavior during execution to said 

4 thread scheduler. 

1 22. The system of claim 21 , wherein said thread scheduler regroups and 

2 reschedules, at the next scheduling time, said multiple threads to said plurality of 

3 processors based on said feedback on said data sharing behavior during 

4 execution. 
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ABSTRACT OF THE DISCLOSURE 

According to embodiments of the subject matter disclosed in this 
application, a compiler in a multiprocessor system may compile a received 
multithreaded application, analyze data sharing behavior among multiple threads 
of the multiprocessor application, and provide such information to a thread 
schedule in the multiprocessor system. Threads that share data frequently may 
be grouped together based on the data sharing information provided by the 
compiler, and at run time, the thread scheduler may schedule threads in the 
same group to processors that are in proximity to each other electronically. 
Additionally, a feedback module may collect information on data sharing 
behavior among threads during execution and feedback such information to the 
thread scheduler. The thread scheduler may use the feedback information to 
regroup and reschedule the threads at the next available scheduling time. 
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APPLICATION 



COMPILE THE APPLICATION 
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SHARING INFORMATION TO 
THREAD SCHEDULER 



GROUP THREADS USING DATA 
SHARING INFORMATION 



SCHEDULE THREADS TO 
TARGET PROCESSORS USING 
GROUPING INFORMATION 



EXECUTE THREADS ON 
TARGET PROCESSORS 



OBSERVE DATA SHARING 
BEHAVIOR DURING EXECUTION 



PROVIDE FEEDBACK ON THE 
DATA SHARING BEHAVIOR TO 
THREAD SCHEDULER 
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500 



510 


void test(List p) 


520 


{ 


530 


while (p != NULL) 


540 


{ 


550 


do_work(p); 


560 


p = p->next; 


570 


} 


580 


} 



FIG. 5 

600 



605 


void test(List p) 


610 


{ 


615 


#pragma intel omp parallel taskq shared(p) num threads(4) 


620 


{ 


625 


while (p != NULL) 


630 


{ 


635 


#pragma intel omp task captureprivate(p) 


640 


{ 


645 


do work(p); 


650 


} 


655 


p = p->next; i 


660 


} 


665 


} 


670 


} 



FIG. 6 
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